#This is a report about prediction of stroke. Original data came from Coursera’s Course - “Build and deploy a stroke prediction model using R”.

Stroke is a leading cause of death and disability worldwide, with significant public health implications. Some key facts about stroke:

The dataset came from Kaggle via Coursera and includes information on patients such as:

The dataset consists of 5110 observations, with 249 patients experiencing a stroke. This imbalance indicates that only 4.87% of the total observations involve stroke occurrences.


A few general informations about dataset:

dim(stroke) 
## [1] 5110   12
head(stroke) 
##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 2 51676 Female  61            0             0          Yes Self-employed
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level   bmi  smoking_status stroke
## 1          Urban            228.69 36.60 formerly smoked      1
## 2          Rural            202.21 28.89    never smoked      1
## 3          Rural            105.92 32.50    never smoked      1
## 4          Urban            171.23 34.40          smokes      1
## 5          Rural            174.12 24.00    never smoked      1
## 6          Urban            186.21 29.00 formerly smoked      1
str(stroke) 
## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : chr  "Male" "Female" "Male" "Female" ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr  "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : num  36.6 28.9 32.5 34.4 24 ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...
names(stroke)
##  [1] "id"                "gender"            "age"              
##  [4] "hypertension"      "heart_disease"     "ever_married"     
##  [7] "work_type"         "Residence_type"    "avg_glucose_level"
## [10] "bmi"               "smoking_status"    "stroke"
summary(stroke)
##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married        work_type         Residence_type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##  avg_glucose_level      bmi        smoking_status         stroke       
##  Min.   : 55.12    Min.   :10.30   Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25    1st Qu.:23.80   Class :character   1st Qu.:0.00000  
##  Median : 91.89    Median :28.40   Mode  :character   Median :0.00000  
##  Mean   :106.15    Mean   :28.89                      Mean   :0.04873  
##  3rd Qu.:114.09    3rd Qu.:32.80                      3rd Qu.:0.00000  
##  Max.   :271.74    Max.   :97.60                      Max.   :1.00000
skim(stroke)
Data summary
Name stroke
Number of rows 5110
Number of columns 12
_______________________
Column type frequency:
character 5
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 0 1 4 6 0 3 0
ever_married 0 1 2 3 0 2 0
work_type 0 1 7 13 0 5 0
Residence_type 0 1 5 5 0 2 0
smoking_status 0 1 6 15 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 36517.83 21161.72 67.00 17741.25 36932.00 54682.00 72940.00 ▇▇▇▇▇
age 0 1 43.23 22.61 0.08 25.00 45.00 61.00 82.00 ▅▆▇▇▆
hypertension 0 1 0.10 0.30 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
heart_disease 0 1 0.05 0.23 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
avg_glucose_level 0 1 106.15 45.28 55.12 77.24 91.88 114.09 271.74 ▇▃▁▁▁
bmi 0 1 28.89 7.70 10.30 23.80 28.40 32.80 97.60 ▇▇▁▁▁
stroke 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁

The average age in the study group is
43.3. The average glucose level in this study group is 106.15. In the study group, one person marked the gender “other”. The record was removed to make the data visualization more readable. Ultimately, 5,109 observations were obtained.

Missing values were found in the “bmi” column. It was decided to enter the average of all BMI measurements in the missing fields and round it to two decimal places.

The charts below show the distribution of demographic variables: gender, place of residence and place of work.

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The charts below show the distribution of health related variables: hypertension, heart disease, smoking status.

BMI data was organized according to the “BMI Classification Percentile And Cut Off Points” classification. The BMI criteria distributed in the study group are presented in the chart below.

As we could see, the majorty patients have overweight. This corresponds to disturbing reports from other studies regarding overweight. As we could read on the WHO page (https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight):

“In 2022, 2.5 billion adults (18 years and older) were overweight. Of these, 890 million were living with obesity. In 2022, 43% of adults aged 18 years and over were overweight and 16% were living with obesity.

In the studied group there is 1610 patients with overweight. Which constitutes ‘percentage’ of the entire group.

The chart below shows the distribution of patients with and without stroke depending on their smoking status.

Patients without stroke Patients with stroke
formerly smoked 814 70
never smoked 1802 90
smokes 747 42
Unknown 1497 47

The chart below shows the distribution of patients with and without stroke depending on occurence of hypertension.

Patients without stroke Patients with stroke
formerly smoked 765 120
never smoked 1660 232
smokes 695 94
Unknown 1492 52

The chart below shows the distribution of patients with and without stroke depending on occurence of hypertension.

No Hypertension Hypertension
No Stroke 4429 432
Stroke 183 66

A relationship between age and glucose levels has been observed. The chart below shows this relationship, including stroke patients and non-stroke patients alike.

The chart shows that high glucose levels are more common in older patients. This chart also differentiates between people with diagnosed stroke (orange dots) and those without stroke (green dots). It is clear that the number of patients with stroke increases with age and with increasing glucose levels. This is also confirmed by the results of the correlations: - age and glucose: r=0.24, p<0.001; - age and stroke: r=0.25, p<0.001; - stroke and glucose: r=0.12, p<0.001.

There were also differences between the sexes and the average blood sugar levels. This is illustrated in the 2 graphs below, differentiating between stroke and non-stroke patients.

#Conclusions #Future studies Future research should take into account more factors related to health behaviors, such as eating habits, physical activity, and alcohol consumption. In the future, clinical indicators as: waist-height ratio, visceral adipose issue, triglyceride-glucose index should also be considered. The use of blood thinning medications should also be considered.